autodiscovery: advanced auto-config discovery via Python discover() bridge#50199
autodiscovery: advanced auto-config discovery via Python discover() bridge#50199
Conversation
Go Package Import DifferencesBaseline: 80e785f
|
This comment has been minimized.
This comment has been minimized.
Files inventory check summaryFile checks results against ancestor 80e785f4: Results for datadog-agent_7.80.0~devel.git.470.15e6784.pipeline.111287371-1_amd64.deb:No change detected |
Static quality checks✅ Please find below the results from static quality gates Successful checksInfo
9 successful checks with minimal change (< 2 KiB)
On-wire sizes (compressed)
|
Regression DetectorRegression Detector ResultsMetrics dashboard Baseline: 80e785f Optimization Goals: ✅ No significant changes detected
|
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | docker_containers_cpu | % cpu utilization | -0.20 | [-3.13, +2.73] | 1 | Logs |
Fine details of change detection per experiment
| perf | experiment | goal | Δ mean % | Δ mean % CI | trials | links |
|---|---|---|---|---|---|---|
| ➖ | quality_gate_logs | % cpu utilization | +0.70 | [-0.28, +1.68] | 1 | Logs bounds checks dashboard |
| ➖ | tcp_syslog_to_blackhole | ingress throughput | +0.65 | [+0.47, +0.83] | 1 | Logs |
| ➖ | otlp_ingest_logs | memory utilization | +0.47 | [+0.37, +0.57] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulative | memory utilization | +0.27 | [+0.11, +0.43] | 1 | Logs |
| ➖ | ddot_metrics_sum_delta | memory utilization | +0.14 | [-0.05, +0.33] | 1 | Logs |
| ➖ | file_to_blackhole_0ms_latency | egress throughput | +0.03 | [-0.50, +0.56] | 1 | Logs |
| ➖ | docker_containers_memory | memory utilization | +0.02 | [-0.08, +0.12] | 1 | Logs |
| ➖ | file_to_blackhole_1000ms_latency | egress throughput | +0.01 | [-0.41, +0.44] | 1 | Logs |
| ➖ | file_to_blackhole_500ms_latency | egress throughput | +0.01 | [-0.40, +0.41] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api | ingress throughput | -0.00 | [-0.20, +0.19] | 1 | Logs |
| ➖ | uds_dogstatsd_to_api_v3 | ingress throughput | -0.01 | [-0.20, +0.19] | 1 | Logs |
| ➖ | tcp_dd_logs_filter_exclude | ingress throughput | -0.01 | [-0.10, +0.09] | 1 | Logs |
| ➖ | file_to_blackhole_100ms_latency | egress throughput | -0.02 | [-0.16, +0.11] | 1 | Logs |
| ➖ | quality_gate_idle | memory utilization | -0.06 | [-0.11, -0.02] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_metrics | memory utilization | -0.10 | [-0.30, +0.09] | 1 | Logs |
| ➖ | otlp_ingest_metrics | memory utilization | -0.10 | [-0.27, +0.06] | 1 | Logs |
| ➖ | ddot_metrics_sum_cumulativetodelta_exporter | memory utilization | -0.17 | [-0.41, +0.06] | 1 | Logs |
| ➖ | uds_dogstatsd_20mb_12k_contexts_20_senders | memory utilization | -0.18 | [-0.23, -0.13] | 1 | Logs |
| ➖ | docker_containers_cpu | % cpu utilization | -0.20 | [-3.13, +2.73] | 1 | Logs |
| ➖ | quality_gate_idle_all_features | memory utilization | -0.34 | [-0.38, -0.30] | 1 | Logs bounds checks dashboard |
| ➖ | ddot_logs | memory utilization | -0.76 | [-0.82, -0.70] | 1 | Logs |
| ➖ | file_tree | memory utilization | -0.78 | [-0.83, -0.74] | 1 | Logs |
| ➖ | quality_gate_metrics_logs | memory utilization | -1.23 | [-1.47, -0.98] | 1 | Logs bounds checks dashboard |
Bounds Checks: ✅ Passed
| perf | experiment | bounds_check_name | replicates_passed | observed_value | links |
|---|---|---|---|---|---|
| ✅ | docker_containers_cpu | simple_check_run | 10/10 | 681 ≥ 26 | |
| ✅ | docker_containers_memory | memory_usage | 10/10 | 244.51MiB ≤ 370MiB | |
| ✅ | docker_containers_memory | simple_check_run | 10/10 | 715 ≥ 26 | |
| ✅ | file_to_blackhole_0ms_latency | memory_usage | 10/10 | 0.16GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_0ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_1000ms_latency | memory_usage | 10/10 | 0.21GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_1000ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_100ms_latency | memory_usage | 10/10 | 0.17GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_100ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | file_to_blackhole_500ms_latency | memory_usage | 10/10 | 0.19GiB ≤ 1.20GiB | |
| ✅ | file_to_blackhole_500ms_latency | missed_bytes | 10/10 | 0B = 0B | |
| ✅ | quality_gate_idle | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle | memory_usage | 10/10 | 139.91MiB ≤ 147MiB | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | intake_connections | 10/10 | 3 ≤ 4 | bounds checks dashboard |
| ✅ | quality_gate_idle_all_features | memory_usage | 10/10 | 467.72MiB ≤ 495MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | intake_connections | 10/10 | 4 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_logs | memory_usage | 10/10 | 175.72MiB ≤ 195MiB | bounds checks dashboard |
| ✅ | quality_gate_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | cpu_usage | 10/10 | 349.22 ≤ 2000 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | intake_connections | 10/10 | 3 ≤ 6 | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | memory_usage | 10/10 | 370.03MiB ≤ 430MiB | bounds checks dashboard |
| ✅ | quality_gate_metrics_logs | missed_bytes | 10/10 | 0B = 0B | bounds checks dashboard |
Explanation
Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%
Performance changes are noted in the perf column of each table:
- ✅ = significantly better comparison variant performance
- ❌ = significantly worse comparison variant performance
- ➖ = no significant change in performance
A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".
For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:
-
Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
-
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
-
Its configuration does not mark it "erratic".
CI Pass/Fail Decision
✅ Passed. All Quality Gates passed.
- quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_idle, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check intake_connections: 10/10 replicas passed. Gate passed.
- quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
- quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
6f34723 to
15e6784
Compare
|
Reopening as a draft from the renamed branch |
For the advanced auto-config experiment. New optional field on integration.Config, populated by the auto_conf_discovery.yaml provider in a follow-up commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Recognise the discovery: block in the file format and populate integration.Config.Discovery. The file is picked up via the existing .yaml extension matcher; only the configFormat struct gains a new field and GetIntegrationConfigFromFile copies it into the returned integration.Config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hints first (when exposed), then remaining exposed ports in declared order. Dedup-aware. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-(serviceID, configHash) cache. Successes never expire; failures expire after caller-supplied TTL. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
HTTP-GET each candidate port + path with a 500ms per-probe budget and a 2s overall budget. Verify Content-Type is text/plain or application/openmetrics-text and that the body's first non-comment line is a Prometheus exposition line. Cache success/failure per (serviceID, config hash). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tiny shim so %%discovered_port%% resolution can flow through the existing GetExtraConfig path; no resolver signature change required. Also tightens fakeService.GetExtraConfig in the prober tests to error on unknown keys (matches the contract of real Service impls). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Routes via Resolvable.GetExtraConfig("discovered_port"). Populated by
autodiscovery/discovery's serviceWithProbeResult wrapper after a
successful probe.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a Config has Discovery set, run the OpenMetrics prober against the matched Service before configresolver.Resolve. On match wrap the service so %%discovered_port%% resolves; on no match skip scheduling the check (logged at DEBUG). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SubstituteTemplateEnvVars is called at config-load time with a nil service. Without a nil check, GetDiscoveredPort panicked on res.GetExtraConfig. Match the pattern used by GetPort/GetPid/ GetHostname: return a NoResolverError early when res is nil so the caller can ignore it (config_reader.go:517 already does). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…plan Cross-language plan (Go + C++ + Python) for the Agent-side infrastructure that calls a Python discover() classmethod via rtloader, replacing the existing krakend-experiment Go prober and %%discovered_port%% template var. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
autoconfig.go calls discoverer.NewPythonBridge() unconditionally; without this stub the symbol is undefined in builds where the python tag is absent (e.g. cluster agent). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Records the exact build + bind-mount sequence that successfully validates the Plan B implementation against a real krakend container. Includes the pitfalls hit during the manual run (Python ABI mismatch, RUNPATH/RPATH bind mounts, conf.d vs data/ confusion, Python init race) so an automated harness can avoid each one. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous commit accidentally added "py" to ruff's exclude list to work around a pre-commit hook failure on a transient local working-tree directory. The directory is gone; revert the config change.
Surfaces ErrPythonNotReady from the Python bridge when rtloader has not yet initialised, and skips the negative cache for that error so the next AD reconcile event re-attempts the probe. Fixes a startup race where AD reconciles before Python init completes (~30s gap), caches the failure, and never re-probes in stable conditions — the krakend e2e smoke test previously had to bounce the target container to clear the cache. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves the AD-vs-Python-init startup race for advanced auto-config templates. Previously, AutoDiscovery's first reconcile fired before rtloader.Initialize completed; the discoverer returned ErrPythonNotReady (uncached after the previous fix) and no future event triggered a retry in stable conditions, so the integration's check was never scheduled without manually bouncing the target container. - pkg/collector/python: signalPythonReady closes a once-channel at the end of Initialize; WaitReady blocks on it. - discoverer.WaitForPython is the public entry point (with a no-op stub for builds without the python tag, so cluster-agent compiles cleanly). - configmgr.rescanDiscoveryTemplates iterates active services with Discovery templates and re-runs reconcileService for each. - AutoConfig.start launches a fire-and-forget goroutine that waits for Python to be ready and then runs the rescan. The bridge MUST NOT block on Python init in the AD reconcile path: fx hooks are sequential and that would deadlock against the very hook that triggers Initialize. Verified end-to-end against the krakend tests/docker compose: krakend check is now scheduled ~9 s after agent start without any manual container bounce, sourcing http://<container-ip>:9090/metrics from the Python discover() result. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the manual krakend-bounce step now that AutoConfig automatically re-reconciles services with discovery templates once Python is ready. Adds a note on the "skipped — python not yet ready" startup log being expected and benign, plus the dev/lib rtloader restore step (needed after every agent rebuild because cmake links against host Python 3.12). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`dda inv agent.build` re-links rtloader against the host's python3.X-dev headers and overwrites the bazel-built .so files in dev/lib/. The resulting agent fails inside the discovery-dev image with `libpython3.12.so.1.0: cannot open shared object file` because the container ships Python 3.13. Detect this by extracting the libpython version the rtloader is linked against and confirming the matching libpython exists in dev/embedded/lib/ (where bazel installs it). Fail with the exact remediation commands instead of letting the user discover the issue inside the running agent container.
This reverts commit 7a95910. The rescan-on-Python-ready mechanism is being replaced by an in-bridge lazy InitPython that mirrors the python check loader's existing convention (loader.go: pythonOnce.Do(InitPython) when python_lazy_loading is true). The lazy-init shape is simpler, also fixes the CLI agent check subcommand (which hits the same race in a fresh process), and removes ~111 lines of one-shot recovery plumbing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors the python check loader convention (loader.go: pythonOnce.Do + InitPython when python_lazy_loading is true). The discoverer is just another consumer that needs Python; it runs init on demand if no earlier consumer has done so. This fixes the AD-vs-Python startup race for both the agent runtime path AND the CLI 'agent check' subcommand. The previous rescan-on-ready approach handled only the running-agent case (a fresh process re-runs discovery from scratch and never gets a future event to trigger the rescan). The pythonOnce sync.Once shared with the loader makes init idempotent across all callers. python_lazy_loading defaults to true; in eager mode the collector still inits Python in its constructor and the discoverer's check is a no-op. Verified end-to-end against the krakend tests/docker compose: no "skipped — python not yet ready" log, single straight-through "Initializing rtloader" triggered by the discoverer ~6 s after agent start, krakend check [OK] with 84 metrics/run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the "skipped — python not yet ready" log discussion and the rescan-goroutine description in favour of the new straight-through lazy-init path: the discoverer triggers InitPython via pythonOnce, and the krakend check appears [OK] within ~10 s of agent start. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Generalises the krakend experiment into a reusable advanced-autoconfig path: rather than hard-coding an OpenMetrics prober and a
%%discovered_port%%template variable in Go, the Agent now hands the probe decision to a Pythondiscover(cls, service)classmethod on the integration's check class via a new rtloader bridge. The Python side decides what (if anything) to schedule and returns the fully-resolved instance configs back across the boundary.This replaces the krakend-specific Go prober from the earlier revision of this PR with infrastructure any integration can opt into by:
auto_conf_discovery.yaml(ad_identifiers:+discovery: {}presence marker, plus aninstances:template that the Python side may override).discover(cls, service)classmethod on its check class that returnslist[dict](orNoneon no match).Tracks Confluence ticket DSCVR/6650004331.
Companion PR (Plan A helpers in
datadog_checks_base.utils.discovery, the_run_discoverPython bridge helper, and the krakenddiscover()migration): DataDog/integrations-core#23547 (branchvitkyrka/disco-autoconfig).Implementation plan:
docs/superpowers/plans/2026-05-06-discover-agent-bridge.md.What's in this PR
New file format (
auto_conf_discovery.yaml) — picked up by the existing file config provider (comp/core/autodiscovery/providers/config_reader.go). A non-nildiscovery:block onintegration.Configis the presence marker that this is a discovery template; the per-integration logic lives entirely on the Python side.comp/core/autodiscovery/discoverer/package — Go orchestration:Discoverer/Bridgeinterfaces, decoupled from rtloader for testability.defaultDiscoverermarshals the matchedlisteners.Serviceto JSON, calls the bridge, and converts the returned list-of-dicts intointegration.Configvalues to schedule.(serviceID, integration_name); successes pinned, failures expire after 30s.ErrPythonNotReadyis treated as transient and not cached, so the next AD reconcile retries instead of sitting on a stale failure for the TTL.rtloader
run_discoverbridge — new pure-virtualRtLoader::runDiscover,Three::runDiscoverimplementation (rtloader/three/three.cpp), C export inrtloader/rtloader/api.cpp, and the cgo wrapperpkg/collector/python/discover.go. The bridge callsdatadog_checks.base.utils.discovery._run_discover(check_class, service_json)which builds aServicedataclass, invokescls.discover(service), and returns the JSON-encoded result.Lazy Python init from the bridge — mirrors the python check loader's existing
pythonOnce.Do(InitPython)convention. Fixes the AD-vs-Python startup race for both the running agent and theagent checkCLI subcommand without the rescan-on-ready plumbing that an earlier iteration of this branch carried (and that has since been reverted — see commits7a95910then4c09170).AD reconcile path —
configmgrruns the discoverer beforeconfigresolver.Resolvewhenever a template'sDiscoveryfield is set; on no-match the check is not scheduled (logged at DEBUG); on match the resolved instances are scheduled directly without going through any template-variable substitution.Removed (vs. earlier revision of this PR) — the Go OpenMetrics prober, the
serviceWithProbeResultwrapper, and the%%discovered_port%%template variable. The probe logic and any port hint handling now live in Python (krakend'sdiscover()uses thehttp_probe+is_prometheus_expositionhelpers fromdatadog_checks_base).dev/e2e tooling
tasks/discovery_dev.py+test/dockerfiles/discovery-dev/—dda inv discovery-dev.build-imageproduces an agent image with the dev tree bind-mounted, with a guard that fails fast whendda inv agent.buildhas re-linked rtloader against the host's libpython (the container ships Python 3.13).docs/superpowers/2026-05-06-discover-e2e-smoke.md— manual smoke procedure (full build + bind-mount sequence) used to validate end-to-end against a real krakend container; intended as the basis for an automated harness.Test plan
dda inv test --targets=./comp/core/autodiscovery/...,./pkg/collector/python— unit tests pass (discoverer with fake bridge, cache, providers, integration config).dda inv linter.go— clean on touched packages.bazel build //rtloader/...— C++ bridge builds;Three::runDiscoverexercised through agent build.docs/superpowers/2026-05-06-discover-e2e-smoke.md: agent comes up, lazy-init triggers ~6s in, krakend check goes[OK]with 84 metrics/run sourcinghttp://<container-ip>:9090/metricsfrom the Pythondiscover()result.python_bridge_nopython.gostub keepsdiscoverer.New(nil)compiling and resolves discovery templates fail-closed.Known limitation (carried forward)
The discoverer call still runs while the configManager mutex is held — serialises service reconciliation while Python is running. Acceptable for the experiment; should move outside the lock (or async) before broadening adoption.
🤖 Generated with Claude Code